Molecular patterns identify distinct subclasses of myeloid neoplasia

Genomic mutations drive the pathogenesis of myelodysplastic syndromes and acute myeloid leukemia. While morphological and clinical features have dominated the classical criteria for diagnosis and classification, incorporation of molecular data can illuminate functional pathobiology. Here we show that unsupervised machine learning can identify functional objective molecular clusters, irrespective of anamnestic clinico-morphological features, despite the complexity of the molecular alterations in myeloid neoplasia. Our approach reflects disease evolution, informed classification, prognostication, and molecular interactions. We apply machine learning methods on 3588 patients with myelodysplastic syndromes and secondary acute myeloid leukemia to identify 14 molecularly distinct clusters. Remarkably, our model shows clinical implications in terms of overall survival and response to treatment even after adjusting to the molecular international prognostic scoring system (IPSS-M). In addition, the model is validated on an external cohort of 412 patients. Our subclassification model is available via a web-based open-access resource (https://drmz.shinyapps.io/mds_latent).

In the current study, the authors utilized an unsupervised clustering approach based on autoencoders coupled with Gaussian-mixture modeling (GMM), which is one of the best methods for unsupervised clustering. This GMM method could rationally divide more than 3000 MDS and MPN patients into 14 clusters based on specific genetic alterations. Patients from each cluster also showed distinct clinical characteristics and prognosis. Further, the authors divided the 14 groups of patients into five combined risk groups according to their survival information: low, int-low, inthigh, high and very high. This new risk stratification was further validated by an external patient cohort. Finally, the authors built a website for predicting the clusters of patients from the real world. Overall, there is a big patient cohort included into this study (over 3000 patients) and therefore the conclusions are relatively convincing. Although this research is of clinical interest and suggests a potentially novel risk stratification system for MDS and MPN patients, I have several major concerns regarding applicability of the presented data: Major: 1. Table 2 reflects the similarities and differences of various clinical parameters in the 14 clusters. For some factors, such as platelet and diagnosis, distinct characteristics could be observed. For example, cluster 4 showed the highest platelet count and cluster 11 showed the lowest level. Cluster 6 and Cluster12 showed the highest proportion of CMML as compared to other clusters. These data are interesting but not clearly visualized or emphasized based on the current version. If these data could be displayed with a figure and analyzed by statistics, it could further help readers understand the similarities and differences in the clinical parameters among 14 clusters. Figure 3B-D the authors divide the 14 clusters of 2806 patients into five combined risk groups, this novel risk stratification system is not used in the web tool. In the current version, the web tool is based on 14 individual clusters instead of risk stratification groups, which is less suitable for the clinical use. Preferably, the web tool could be organized similarly to IPSS-M calculator, that is, after entering the cytogenetic and molecular information of a patient, the corresponding risk stratification can be directly calculated rather than genetic clusters.

Although in
3. The clusters in the web tool are numbered in a different way when compared to the manuscript. For example, if we use Normal Karyotype and SF3B1 mutations as an input, the tool shows distribution probability for clusters 0-13, whereas cluster numbers 1-14 are used in the manuscript. Moreover, described above input settings result in the highest probability for cluster 13 identified in the manuscript by TP53 mutations and Complex Karyotype. The authors should correct discrepancies between clustering in the manuscript and online tool.
4. The current IPSS-M risk stratification contains molecular genetic information on 31 genes, which show the strongest prognostic significance in MDS patients. Therefore, the reviewer feels that it is important to show the proportions of different IPSS-M risk groups among 14 clusters or 5 prognostic risk stratifications as presented in Figure 1C, rather than only being limited to IPSS-R. Based on this comparison, the similarity and difference between the two models can be further explored. Furthermore, the current study would be of higher scientific and potential clinical value if the authors quantitatively compare the performance of their own model (reduced to 5 prognostic risk stratification groups) and the IPSS-M model. For example, receiver operating characteristic (ROC) curve, Calibration curve and Decision curve analysis (DCA) could be performed. 5. The stratification system containing 14 clusters is not designed to compete with the current IPSS-R or IPSS-M system as stated in the discussion. How will it be implemented in the MDS risk stratification in the clinical settings? In which situations can this system be applied?
6. Some of the clusters generated by the authors in the study are characterized by similar genetic signatures. For example, clusters 10 and 12 are characterized by Normal Karyotype and TET2 mutations with 100% frequency. If a random patient with Normal Karyotype and TET2 mutation without additional genetic lesions is entered into this system, to which cluster will it be assigned? Importantly, cluster 10 is grouped into low risk group, whereas cluster 12 is assigned to Int-High risk group. Unfortunately, the current version of the web tool does not provide an answer to this question. This also raises the question of whether additional information, including blast % and data on cytopenias, are required in these cases for more precise stratification.
7. The authors mentioned that their clustering model was able to highlight survival differences among IPSS-R risk groups (Supplementary Figure 11). However, these survival differences are subtle. Moreover, groups with higher risk (groups 5 and 2) show superior survival inside low-IPSS-R and intermediate IPSS-R risk groups. Visually, some survival curves suffer from low sample size. Therefore, the information about patient numbers for each group and numbers at risk are needed for this Figure. 8. In Figure 3E, the authors compared the proportion of treatments in the 14 clusters. It would be very interesting if the authors could further compare the response of patients to different treatments in each cluster. Minor: 1. In tables 1 and 2, the authors should describe how they identified low(er)-risk and high(er)-risk MDS patients. Was it done based on IPSS-R risk stratification? In this case, in which category were intermediate risk patients assigned?
2. Did the authors perform statistical analysis for MC1-14 shown in Figure 1D? For example, the data suggest that there could be differences for MC8 and MC12 in original and validation cohorts. Figure 3B and Supplementary Table 8 (e.g. low risk = 1, very high risk =5) Table 7. Are those data from the validation cohort and why median survival times are not available for many clusters? It would be helpful to provide patient numbers (n) for each cluster and present the data for training set in parallel.

It is not completely clear what is shown in Supplementary
5. It was also not clear to the reviewer why HR values for groups 2 and 3 in comparison to group 1 (low risk) are below 1 (Supplementary Table 9). There are also very big CIs despite very low pvalues even for training set.
6. Some figures are not presented in the order described in the text. For example, the authors described Figure  1. Please provide more details about how the model is developed and evaluated in the supplementary methods section, including the following information: What is the binary mutation profile? is it a variant or gene-level profile? How is it derived from the WES data? And should it be defined clearly in the manuscript?
How was the RF model trained? It is claimed in the Supplementary Methods section that the input is the binary mutation profile. However, features like Normal KT, Complex KT, Other KT were seen in many feature importance plots, such as Supplementary Figure 3 and 4.
2. A baseline model could be provided and compared with the model proposed. This will show the necessity of using this complex model involving autoencoder, GMM, hierarchical clustering, and RF.
3. With the current model evaluation results, I'm not convinced about the robustness.
The similar number of clusters in each fold (in Line 151) is not enough to show the robustness.
The calculation of the ARI is not clearly described in the manuscript or supplementary materials.
According to the R script named by cluster_folds.R in the GitHub page provided by the author, ARI here is not fair to show the consistency of results.
When calculating the ARI between Fold-1 and Fold-2: ARI_1,2=ARI( {labels of set 2,3,4,5 predicted by Fold1}, {labels of set 2,3,4,5 predicted by Fold2} ) But set 3, 4, and 5 were training data in both Fold-1 and 2. The cross-validation evaluation results shouldn't contain shared training samples. Thus, I suggest the author to calculate the ARI between Fold-1 and Fold-2 by: ARI_1,2=ARI( {labels of set 1 predicted by Fold1}, {labels of set 1 predicted by Fold2} ) What is the statistical test for "no significant differences" in Line 156? 4. Please provide summary figure or statistics for the claim in Line 171 "Blast percentages in MCs were consistent with the risk distribution of cases, and the median blast percentage was consistent with the composition of each MC." The current results provided are not clear for this conclusion. Figure 11 and the example in the very low IPSS-R group, please provide more evidence to rule out the confounding factor effect of IPSS-R in studying the overall survival differences among MCs, since it was mentioned in Line 167 that "the distribution of different revised international prognostic scoring system (IPSS-R) risk groups among our MCs were distinct and heterogenous." This is essential in determining the novelty of the findings in OS differences. Minor:

Besides Supplementary
1. Figure 2 is rarely discussed in the manuscript. Should the author attach more writing to the results if it is essential, or put it in supplementary figures and take other plots to the main figure?
2. The color palette consistency In Figure 1, please make sure each label has only one color. E.g., A and C have different label colors for LR-MDS, HR-MDS, and sAML. The five risk group labels are inconsistent between Supplementary Figure 10A and others.

RESPONSE TO REVIEWERS' COMMENTS
Reviewer #1 (Remarks to the Author): expertise in AML genomics Major: 1. Table 2 reflects the similarities and differences of various clinical parameters in the 14 clusters. For some factors, such as platelet and diagnosis, distinct characteristics could be observed. For example, cluster 4 showed the highest platelet count and cluster 11 showed the lowest level. Cluster 6 and Cluster12 showed the highest proportion of CMML as compared to other clusters. These data are interesting but not clearly visualized or emphasized based on the current version. If these data could be displayed with a figure and analyzed by statistics, it could further help readers understand the similarities and differences in the clinical parameters among 14 clusters.
Reply: We thank the reviewer for this valuable comment. To further clarify the significant differences in clinical parameters, including diagnosis and laboratory values, between the molecular clusters, we performed a statistical analysis and compared the clinical parameters between all clusters. We now added P-values to Table- Figure  7D-F). For instance, patients in MC1, MC11, and MC13 had significantly lower platelet counts (median of 87, 48, and 76 10 9 /L, respectively, p-value <0.001) compared to other clusters. The highest median hemoglobin level (11 g/dL) was observed in patients assigned to MC6, whereas patients within MC1, MC8, MC11, and MC13 had lower values (median around 9 g/dL)." Figure 3B-D the authors divide the 14 clusters of 2806 patients into five combined risk groups, this novel risk stratification system is not used in the web tool. In the current version, the web tool is based on 14 individual clusters instead of risk stratification groups, which is less suitable for the clinical use. Preferably, the web tool could be organized similarly to IPSS-M calculator, that is, after entering the cytogenetic and molecular information of a patient, the corresponding risk stratification can be directly calculated rather than genetic clusters.

Although in
Reply: We want to thank the reviewer for this suggestion. We have now updated the web tool to include the risk stratification with the molecular clusters. We have also included additional details in the supplementary material explaining the risk-model generation for the web tool. Specifically, we have included in the supplementary methods the following paragraph: "In order to include the risk stratification to the assigned molecular clusters, we generated a random-forest classifier using the identified risk groups. as well. We then followed a similar approach for hyperparameter tuning. Furthermore, our risk stratification grouping has been added to the web tool to help with the clinical application of our model." 3. The clusters in the web tool are numbered in a different way when compared to the manuscript. For example, if we use Normal Karyotype and SF3B1 mutations as an input, the tool shows distribution probability for clusters 0-13, whereas cluster numbers 1-14 are used in the manuscript. Moreover, described above input settings result in the highest probability for cluster 13 identified in the manuscript by TP53 mutations and Complex Karyotype. The authors should correct discrepancies between clustering in the manuscript and online tool.
4. The current IPSS-M risk stratification contains molecular genetic information on 31 genes, which show the strongest prognostic significance in MDS patients. Therefore, the reviewer feels that it is important to show the proportions of different IPSS-M risk groups among 14 clusters or 5 prognostic risk stratifications as presented in Figure 1C, rather than only being limited to IPSS-R. Based on this comparison, the similarity and difference between the two models can be further explored. Furthermore, the current study would be of higher scientific and potential clinical value if the authors quantitatively compare the performance of their own model (reduced to 5 prognostic risk stratification groups) and the IPSS-M model. For example, receiver operating characteristic (ROC) curve, Calibration curve and Decision curve analysis (DCA) could be performed.
Reply: We thank the reviewer for this relevant and excellent comment. Based on the reviewer's suggestion, we calculated IPSS-M scores for the patients in our cohort (and added explanations as to that in the supplementary methods) and presented the frequencies of IPSS-M risk groups among our molecular clusters and our risk groups ( Figure  1D, lower panel). As we emphasized in the article, our aim was not to provide a prognostic score, but to clarify the possible molecular interactions between different molecular mutations in patients assigned to similar molecular clusters, which for sure will affect prognosis, treatment response, and other clinical/prognostic features. However, to ensure that our machine learning model is applicable clinically and is not contradicting the current well-established molecular/clinical tools, we compared our model with IPSS-M using Harrel's C index (New Supplementary Figures 15 and 16). We also incorporated clinical features (age, sex, blast percent, hemoglobin and platelet count) into our model and compared it with IPSS-M. This was added to the main text and the supplementary materials: "Similarly, very high risk and high risk IPSS-M groups were mainly enriched within MC1, MC9, MC11, MC12, and MC13 ( Figure 1D, lower panel)." AND "Patients assigned to low, moderate low, high and very high IPSS-M risk groups had survival differences based on our model (Supplementary Figure 15)." AND "In order to compare the utility of the proposed model in time-to-event modeling with IPSS-M using OS, we bootstrapped Harrel's C-index differences. Incorporating clinical variables (age, sex, blast percent, hemoglobin and platelet counts) showed no significant difference between IPSS-M and the proposed clusters (Supplementary Figure 16). Overall, our novel model was comparable and significantly overlapping with IPSS-M in which high risk MCs/groups had higher IPSS-M scores (Supplementary Figure 17)." Figure 1D: between different mutations in patients assigned to similar molecular clusters which for sure will affect prognosis, treatment response and other clinical/prognostic features. Our molecular-based approach was efficient in showing differences among patients with similar single mutations. The purpose of such molecular classification is to identify patients with possibly similar sensitivity to targeted agents, and for the clinical investigators to rationally test efficacy of novel drugs in molecularly related sub-entities. The scoring systems benchmarked on survival may group patients with unrelated pathophysiology into the same groups, which if used for indication for specific treatment (low risk disease vs. high-risk disease), may result in heterogeneous response patterns. The molecular clustering would help to avoid such a scenario. We added these points to our discussion: "The purpose of such classification is to identify patients with common features possibly suggesting a similar rate of sensitivity to targeted agents, and for the clinical investigators to rationally test efficacy of novel drugs in these molecularly-related sub entities. The scoring systems benchmarked solely on survival may bundle patients with unrelated pathophysiology into same groups, which if used for treatment indications (low risk disease vs. high risk disease) may result in heterogeneous response patterns. The molecular clustering would help to avoid such a scenario." 6. Some of the clusters generated by the authors in the study are characterized by similar genetic signatures. For example, clusters 10 and 12 are characterized by Normal Karyotype and TET2 mutations with 100% frequency. If a random patient with Normal Karyotype and TET2 mutation without additional genetic lesions is entered into this system, to which cluster will it be assigned? Importantly, cluster 10 is grouped into low-risk group, whereas cluster 12 is assigned to Int-High risk group. Unfortunately, the current version of the web tool does not provide an answer to this question. This also raises the question of whether additional information, including blast % and data on cytopenias, are required in these cases for more precise stratification.
Reply: We thank the reviewer for this great point. We included the following explanation "While both MC10 and MC12 seemed similar regarding features such as TET2 MT and normal cytogenetics, there were differences based on the absence and/ or presence of certain somatic mutations and cytogenetic abnormalities. The frequency of SF3B1 MT was higher in MC10, while MC12 was enriched with ASXL1 MT (Figure 2 and Supplementary Figure 4). Finally, only 7% of patients in MC10 had more than 4 concurrent somatic mutations, a feature characterizing up to 40% of patients in MC12".
In order to translate these complex interaction patterns, we also updated the version of the web tool as mentioned before to present the risk groups as well. We pointed out the molecular signatures of the clusters to make our model more applicable clinically but the importance of each feature in every single cluster identification is highlighted in Figure 2 and Supplementary Figure 4. 8. In Figure 3E, the authors compared the proportion of treatments in the 14 clusters. It would be very interesting if the authors could further compare the response of patients to different treatments in each cluster.
Reply: We thank the reviewer for this clinically relevant comment. We chose response to treatment as one of the features that can illustrate the biological and clinical differences between the molecular clusters (cluster to cluster differences). We only assessed response to HMAs as shown in Figure 3 since it was the most used treatment in our cohort. Chemotherapy and allogenic HSCT were less observed in our cohort, hence precluding any further analysis because of lack of statistical power.

Minor:
1. In tables 1 and 2, the authors should describe how they identified low(er)-risk and high(er)-risk MDS patients. Was it done based on IPSS-R risk stratification? In this case, in which category were intermediate risk patients assigned?
Reply: We thank the reviewer for raising this point. We classified patients into high-risk MDS (HR-MDS) and low-risk MDS (LR-MDS) groups based on the bone marrow blast percent (5% cutoff). We clarified this point in the table legend:                  (LR-MDS) was defined based on BM blast <5%." 2. Did the authors perform statistical analysis for MC1-14 shown in Figure 1D? For example, the data suggest that there could be differences for MC8 and MC12 in original and validation cohorts.
Reply: We thank the reviewer for the comment. Statistical analysis was done to compare the molecular features between our cohort and validation cohorts. No significant differences were found. We clarified this point in the legend of Figure 1: "Significant differences between frequencies based on the Chi-square test are indicated by asterisks. No significant differences in the frequency of the molecular features were found between the original and the validation cohorts within patients assigned to similar MCs." Figure 3B and Supplementary Table 8 (e.g. low risk = 1, very high risk =5) Reply: We thank the reviewer for the comment. Risk group numbers were edited in both Figure 3B and Supplementary Table 8. Table 7. Are those data from the validation cohort and why median survival times are not available for many clusters? It would be helpful to provide patient numbers (n) for each cluster and present the data for training set in parallel.

It is not completely clear what is shown in Supplementary
Reply: We thank the reviewer for the comment. Supplementary Table 7 shows the median OS for the validation cohort stratified by the molecular clusters 1-14. We added the total number in each cluster to the table. The median survival was not reported for some clusters due to the small number of patients. 5. It was also not clear to the reviewer why HR values for groups 2 and 3 in comparison to group 1 (low risk) are below 1 (Supplementary Table 9). There are also very big CIs despite very low p-values even for training set.

Supplementary
Reply: We thank the reviewer for this comment. In the previous analysis, we reported the hazard ratios (HR) using log scale, we edited the table and reported the exponential scale to avoid confusion. There was a mistake regarding the upper CI as well and has been updated. We updated the results in Supplementary Table 9.